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Abstract 

Two large, open source software systems are analyzed from the van- 
tage point of complex adaptive systems theory. For both systems, the full 
dependency graphs are constructed and their properties are shown to be 
consistent with the assumption of stochastic growth. In particular, the 
afferent links are distributed according to Zipf's law for both systems. 
Using the Small- World criterion for directed graphs, it is shown that con- 
trary to claims in the literature, these software systems do not possess 
Small- World properties. Furthermore, it is argued that the Small- World 
property is not of any particular advantage in a standard layered archi- 
tecture. Finally, it is suggested that the eigenvector centrality can play 
an important role in deciding which open source software packages to use 
in mission critical applications. This comes about because knowing the 
absolute number of afferent links alone is insufficient to decide how im- 
portant a package is to the system as a whole, instead the importance of 
the linking package plays a major role as well. 
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1 Introduction 

Scientist and engineers are increasingly turning to the metaphor of the complex 
adaptive system (CAS) [1, ] in order to understand the behavior of very large 
software systems involving large numbers of components [2, 3, 4, 5, 6, 7]. A 
central tenet within the CAS paradigm is that the network of interactions be- 
tween the constituent entities, together with their intrinsic properties, largely 
determines a system's emergent behavior [8, 9, 10]. If the network of interac- 
tions is scale-free then it will combine the robustness properties of a random 
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network with respect to the loss of an arbitrary node, together with the fragility 
properties of a binary network with respect to the loss of one of its hubs. If a 
network exhibits small- world characteristics [11], then information should flow 
efficiently at both global and local levels. 

For software systems the network of interactions takes the form of a depen- 
dency graph between the components. At the system level, static dependency 
graphs play an important role in the build and package management processes; 
hence understanding their properties has important practical consequences for 
system administration. Dependency graphs also provide a finer grained view 
of the system than is usually obtained by looking at high level architectural 
diagrams. 

In this paper, the static dependency graphs for two relatively large soft- 
ware systems, Debian [12] and Maven [13] are examined. Debian is an open 
source Linux distribution comprising over 22, 000 separate software packages 
totaling more than 20 Gigabytes of source code. It is one of the oldest Linux 
distributions and is still employed as a server OS, not to mention its immense 
popularity among Linux enthusiasts. Maven is, in the first place, a software 
project management tool, but it is also a collection of repositories for Java soft- 
ware. It is these repositories which give Maven, as a project management tool, 
its strength, because they offer all Java developers a convenient mechanism for 
tracking package dependencies with minimal effort. 

Approaching these systems from the CAS viewpoint means attempting to 
identify and understand their emergent properties. Previous studies of the Dc- 
bain distribution and collections of Java software have concentrated upon the 
degree distributions of the afferent and efferent links in the dependency graph 
[14, 2, 3, 5, 6, 15]. All studies agree that the afferent degree distribution follows 
a power law, but they obtain significantly different exponents for Debian and 
Java software, leading one to wonder if the differences are due to some funda- 
mental differences in how the software is constructed, or if they simply reflect 
the small sample size used in previous studies of Java software. The main moti- 
vation for using Maven is to have a well defined, standard set of Java software, 
representing a much larger collection of packages than used in previous stud- 
ies. Like the Debian distribution, the Maven repositories represent reproducible 
data sets upon which other researchers can cross-check the results presented 
here. 

The present work has two aims, first it examines the issue of growth models 
for describing the distributions seen in the afferent and efferent degrees in an 
effort to determine whether or not the different distributions seen in these quan- 
tities can be explained in terms of a single model. Using completely separate 
models for describing the distribution of afferent and efferent links as was the 
case in previous studies is not satisfactory as the links arise from within the 
same software development process and should therefore be correlated. 

The second aim of this paper is to look beyond the degree distribution to 
examine small world and centrality issues. An agglomeration of all of these 
properties gives indications regarding the stability, maintainability and maturity 
of the various software packages. 
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1.1 Previous Work 



Maillart et al. [6] measured the afferent degree distribution for the Debian sys- 
tem. Valverde et al. [14] examined Java software and measured the total degree 
distribution by treating the dependency graph as an undirected graph. Baxter 
et al. [16, 15] and Wheeldon and Counsell [4] measured both the afferent and 
efferent degree distribution for different Java packages, whereby they made a 
more fine grained distinction between the different types of dependencies than is 
being made in this paper. Concas et al.[5] also studied fine grained afferent and 
efferent degree distributions related to object-oriented software quality metrics 
for Smalltalk software, while Myers [2] examined these quantities for several 
open source software packages. In all of these studies, the afferent degree dis- 
tribution was found to obey a power law, while the efferent degree distribution, 
the extent it was measured, was found to follow either a power law [2, 4], or 
some other distribution [16, 15, 5]. Even though a power law was obtained for 
the afferent degree distribution in all studies, there was no consensus on the 
exponent, with various estimates ranging from a low of 1.4 to a high of 2.5. 

Concas et al.[5], Maillart et al.[6] and Baxter et al. [15] proposed various 
growth models to explain the origins of the observed phenomena. Concas et al. 
discussed random process involving both independent and proportional growth 
models as well as the Yule type process [17] suggested by Newman [18]. They 
found good agreement between the proportional growth model and the efferent 
degree distribution, as well as good agreement between the predictions of the 
Yule model and the distribution of afferent degrees. Maillart et al. and Baxter et 
al. also discuss Yule type process, wherein the latter model is a discrete growth 
model while the former take a continuum approximation. The continuous model 
of Maillart et al. was applied to the afferent degree distribution and showed 
good agreement with the measurements. The discrete mode of Baxter et al. 
was applied to both the afferent and efferent degrees and while the agreement 
with measurements for the afferent degree was quite good, the agreement with 
the measurements for the efferent degree was not as satisfactory. 

The small-world properties have been previously measured for several open 
source software packages by Moura et al. [3] and Valverde and Sole [19]. Both of 
these studies found evidence for small-world behaviour when using the original 
Watts and Strogatz's [11] definition of a small-world graph. To apply this defini- 
tion, both studies first converted the directed dependency graphs to undirected 
graphs. 

To the best of our knowledge, eigenvector centrality has not previously been 
studied in the context of dependency graphs for large software systems. Other 
methods for ranking software based upon multiple factors are discussed by Tsat- 
saronis et al. [20]. 

1.2 Scope of Present Work 

This paper addresses shortcomings in the previous work by demonstrating that 
the afferent and efferent links in both the Maven and Debian software systems 
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have the same distributions and that these distributions are shown to arise from 
the same simple growth model. Towards this end, the next section discusses 
the construction of the dependency graphs for both systems, while Section 3, 
discusses node centrality and the observed distribution of afferent and efferent 
links in more detail. 

Section 4 discusses the small world properties of the two systems and shows 
why the small-world effects seen in previous studies are an artifact of the con- 
version from directed to undirected graphs. 

Eigenvector centrality, another global measure of centrality better known 
as the Google rank, is examined in Section 5. The final section summarizes 
the results of this study discusses how the CAS viewpoint can be applied to 
problems in software engineering. 

2 Constructing the Dependency Graph 

A dependency graph, G is defined as a pair of sets G = (Af,C), where Af is 
a set of nodes and £ is a set of directed links. If, n,m E JV, then (n,m) 
denotes a directed link from n to m; i.e., a dependency of n on m. The nodes of 
interest here are either packages in the Debian case, or jar files (Java Archives) 
or classes in the Maven case. The directed links symbolize the dependencies 
between the packages, jar files or classes. Define the set of afferent links for a 
given node n as: T>^(n) = {(m, n)\(m, n) £ £} and the set of efferent links as 
T>£(n) = {(n, m)\(n,m) 6 £}. For convience, denote the number of afferent 
links for n by q^ (n) = \T>_^(n)\ (or when speaking of an arbitary node) 
and the number of efferent links by qs(n) = \V £ (n)\. n's total degree is then 

q(n) = qA{n) +<?£(«■)■ 

There are three graphs of interest here: the Debian dependency graph, Gu, 
the Maven jars dependency graph, Gj and the Maven class dependency graph 
Gc- Each of these graphs is constructed from the binary packages without 
recourse to the source code. 

Determining Gd is straight forward since the Debian package management 
system [21] requires the dependency information in order to install the packages, 
the information is readily available in the form of a control file accompanying 
each package and collated into a single Package file for each directory of the 
repository. For each package, its control file list the packages upon which it de- 
pends. There are five types of dependencies defined in the Debian policy manual: 
Depends, Recommends, Suggests, Enhances, Pre-Depends. For the present pur- 
poses no distinction is made between the different types of dependencies, rather 
they are all used to build the edges of the Grj- 

Within a Debian control file it is also possible to define so-called virtual 
packages, which arc logically existing packages whose functionality is provided 
by some concrete package. Virtual packages are defined through the Provides 
field in the control file. For purposes of this study, virtual packages are treated 
as concrete packages. 

Determining Gj and Gc is more involved. The main repositories are housed 
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at maven.org, codehouse.org and javax.org. Together these repositories contain 
some 22, 000 jar files from more than 4, 000 separate projects. Although Maven 
defines a project management file (POM) in XML format which can be used 
to construct Gj in a manner similar to how G^ was constructed, it is more 
instructive to first construct Gc directly, then use the information therein, 
combined with the POM file to construct G,/. Towards this end, the jar files 
for each project are opened and the binary class files are parsed to determine 
their dependencies on other classes. Again, in software engineering one would 
make a distinction between those dependencies which appear on the class' public 
interface, those which are private to the class and those which are local to a 
class method [16, 22, 5]. For the purposes of the present study the technical 
differences between these different types of dependencies are ignored. 

For readers interested in confirming the results presented here, the Debian 
distribution used in this study was version 5.0.0, also known as Lenny. Its first 
official release was on February 14, 2009. (Actually, this study was started using 
a beta version of the Lenny release, but the results were reconfirmed following 
the first official release.) The Maven repositories are in a more continual state 
of flux and the data used here was taken during the first week of January 2009. 

3 Degree Centrality 

The degree centrality of node n is defined as the size of the set of links starting 
or ending at n. It plays an important role in the theory of random networks 
[9]. (Throughout this paper the terms "graph" and "network" will be used 
interchangeably.) When the distribution of q(n) over all n £ G follows a power 
law, then network's tolerance to random failures is increased. For directed 
networks, the distribution of afferent and efferent needs to be treated separately. 

3.1 Afferent Links 

The distribution of afferent links found in this study is depicted in Fig. 1 for 
Debian and Fig. 2 for Maven. Throughout this paper the complementary cu- 
mulative distribution function, P{X > x), is measured in lieu of the probability 
mass distribution, p(x); as the former provides more accurate estimates of the 
distribution's parameters when the data is noisy [23, 18, 24]. The first point to 
notice is that for all G, p{qA) ~ 1~\ w ^ tn a ~ 2 (P{Qa > Qa) ~ 9^ a+1 )- Such 
a distribution generally goes by the name of Zipf 's law and has been previously 
found to hold for a number of different natural and anthropological phenom- 
ena [18]. In fact Maillart et al. [6] have previously demonstrated that this 
relationship holds for the Debian distribution. 

What is new in these results is that Zipf's law holds for Maven as well. 
Using the MLE (maximum likelihood estimate) [2-3, 24], an accurate estimate of 
a can be obtained yielding a — 2.0 ± 0.1, which holds for both the Debian and 
Maven systems. Previous studies on a number of different Java software have 
found exponents ranging from a low of a f» 1.4 to a high ofa« 2.5 [14, 2, 4, 
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16, 5]. The discrepancy between past results and the current experiments are 
due to two factors: 1) previous studies limited the types of dependencies they 
investigated, and 2) previous studies examined only single Java projects, not 
the broad collection available under Maven. Limiting the dependency counting 
to those on the public interfaces suppresses the distribution at large values of 
q_A since the majority of dependencies are hidden as local variables inside the 
body of class' methods. Finding the same exponent for Maven and Debian is 
reassuring as it indicates that similar mechanisms are involved in the engineering 
processes through which the dependency graphs arise. 



3.2 A Yule Process 

The exact mechanisms for generating Zipf 's law behavior are still under debate[18, 
25, 26]. Some suggestions include various Yule processes [17, 27], entropy max- 
imization [28] and highly optimized tolerance [29]. For Graphs, Barabasi and 
Albert [8] defined a variant of Simon's Yule process [17], called preferential at- 
tachment^ which leads to power law behavior in the distribution of the afferent 
and efferent links, and Bollobas et al. [30] extended their model to directed 
graphs. Although preferential attachment is the most popular model, other 
discrete growth models similarly lead to power law behavior in the afferent 
link distribution [31, 32]. On the other hand, Maillart et al. [('>) previously 
demonstrated that a continuous formulation of the Yule process offered a good 
explanation for describing the distribution p{qA)- Here, the aim is to provide 
further support for the latter hypothesis by showing that it can explain the 
distribution of p{qs) as well. 

The essence of all Yule processes is the Gibrat principle [18, 33]: the proba- 
bility that an entity will experience an increase in the value of one its properties 
in the next time step is proportional to the current value of that property In 
other words, the rich get richer. This principle leads to a stochastic formulation 
for the growth, the details of which vary with the particular model. To start 
with, let X = q^, then the growth in X varies over time as according to the 
stochastic equation: 



dX = fiX dt + aX dW t , (1) 

where fx and a are constants and Wt is a standard Wiener process, meaning 
W = 0, W t is continuous, and W t+T - W t ~ N(0,t), Vi,r > 0. Although X 
is in the present case discrete, eq. 1 is nevertheless a good approximation when 
X is large. With the help of the Ito Lemma, eq. 1 can be readily integrated to 
obtain the probability mass distribution: 

1 [ln(x) + (l-2 M / g 2 )<r 2 T/2] 2 

p(x;T) = ^==e **r , (2) 

where T is the time period over which the stochastic growth occurs and the 
initial value of X(t) is taken to be X(Q) = 1. 
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Note that individual sets of afferent links, T>^(n) i have disparate histories, 
i.e., they have been growing for different lengths of time, with the oldest packages 
more than 15 years old and the youngest not more than a year; hence, to 
compare with the data, one needs to average over the distribution of T [8, 34, 33] . 
Although it is possible in principle to track down the lifetime of each package, 
and thereby obtain the true distribution of T, the effort to do so would be 
prodigious. Instead one can make progress by assuming all possible values of 
T up to some large value, T — T, which is the lifetime of the software systems 
themselves, are equally likely: 



Eq. 3 holds for // < er 2 /2 and large T. Under the latter condition, the first term 
in eq. 3 dominates and all higher order terms in T can be neglected. 

When fx <C cr 2 /2, the growth is dominated by the stochastic fluctuations and 
p(x) ~ x~ 2 , in good agreement with the measurements. 

3.3 Efferent Links 

The distribution of efferent links, p(qs), is depicted in Fig. 3 for Debian and 
Fig. 4 for Maven. Figs. 1-4 demonstrate clearly that the distribution of the 
number of afferent links is not commensurate with the distribution of the number 
of efferent links. The number of efferent links follows a lognormal distribution 
in agreement with [5, 15]. 

At first glance this might cast doubt on the explanation of stochastic growth 
presented in the previous section; however, an examination of the assumptions 
leading from eq. 2 to eq. 3 reveals a difference between the two. The set of 
afferent links, 25.4 (n), for a given node, n, grows when another node, m, links to 
it. In software engineering, this linkage occurs if the package or class represented 
by n has services required by the package or class represented by m. Over time, 
as the size of Debian or Maven collection increases, T> J \(n) will continue to grow 
as new packages are built using the services of existing packages. The set of 
efferent links, T>£{n), on the other hand, expands only when the responsibilities 
of package n expand. In software engineering, it is considered poor practice to 
extensively change a class's responsibilities once it has become well established in 
the community. The preferred approach is to create a new class which extends 
the older class. This "open to extension, closed to change" philosophy [35] 
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implies that the set T>£ (n) will grow only for the relatively short time need for n 
to become mature and widely accepted. Assuming that this settling time, to be 
denoted by, T s , is constant, independent of any particular node, and relatively 
short compared with time the newer nodes have been in existence, then the 
distribution, p(qs), would be expect to have the form given in eq. 2 with T s 
replacing T. 

Other models for the distribution of links in a dependency graph, such as 
some form of preferential attachment [31, 32, 30], highly optimized tolerance 
[29], local optimization [14] and entropy maximization [28] predict the same 
distribution for both the afferent and efferent links. Hence, at least in the 
field of software engineering, the model presented here appears to be the most 
suitable in the sense that it is able to accommodate different distributions in the 
afferent and efferent links starting from a common mechanism for their growth. 



Gd, Gj and Gc are all sparse graphs. If they were completely random, then 
one would expect to observe a low degree of local clustering coupled with a 
small, average path length between nodes. A so-called small world graph [11] 
on the other hand, combines a high degree of local clustering with a small, 
average path length between nodes. Watts and Strogatz's original definition 
of a small world network considered only connected graphs, with undirected 
links; whereas all the graphs studied here contain directed links. Furthermore, 
the graphs are disconnected in the sense that is not possible to start at an 
arbitrary node and visit any other arbitrary node while transversing the links 
in their proper direction. (If the links were undirected, then Gj and Gc would 
be connected. Directed graphs fulfilling this condition are often referred to as 
weakly connected.) 

An adequate definition for a small world signature in the case of directed 
graphs was given by Latora and Marchiori [■'->(>]. In their paper they first defined 
the efficiency of a graph as: 



where d nm is the directed distance from node n to node m and d nm ^ d mn . 
As links in the networks studied here are not weighted, d nm is simply the total 
number of links transversed in their proper direction while walking form n to 
m. If there is no directed path from n to m, then by convention d nm = oo. 
(Note that Q is directed acyclic graph, meaning it does not contain any circular 
dependencies, including self-dependencies.) E(G) is normalized, E(G) e [0,1], 
with values near 1 indicate highly efficient information flow, while values near 
indicate information spreads slowly through the network. 

Having defined E(G) to measure how efficiently information flows through 
the entire graph, a similar quantity to measure the efficiency with which infor- 
mation flows locally within subgraphs can be defined as E(G(n)), where G(n) 



4 Small Worlds 





is the subgraph of G consisting of all the nearest neighbors of n together with 
their mutual links, but not n itself, nor any links to n. 

Small world graphs are then defined as those having large values of both 
global, E(G), and local efficiency, (E(G(n))). A random graph on the other 
hand would be expected to exhibit relatively large values of E(G) but relatively 
small values of (E(G(n))) ; while a well ordered graph should exhibit small values 
of E(G) and large values of (E(G(n))). Note: just as in the original Watts and 
Strogatz definition, the Latora and Marchiori definition does not provide exact 
definitions for "small" and "large". Generally, their paper consider values less 
than 0.1 as "small" and those greater than than 0.25 as "large". 

Table 1 lists the values of the local and global efficiency found for the depen- 
dency graphs, Gii, Gj and Gc- As can be seen, all of the graphs have large 
local efficiency, combined with small global efficiencies; hence, these graphs do 
not fulfil the small world conditions. This result contrasts with previous studies 
claiming to have uncovered the small world signature in software dependency 
graphs. [3, 19] 

The problem arises in that previous studies relied on the original Watts- 
Strogatz definition of a small world in terms of undirected graphs rather than 
the Latora-Marchiori definition for directed graphs. Simply treating directed 
links as undirected links gives a false impression of the efficiency with which 
information flows through the network. 

To further understand the previous point, calculate the Pearson correlation 
coefficient between the number of afferent and efferent links at the end points 
of each link in the network: 



where a qa is the standard deviation of q a and a, (3 € {*4, £}. Eq. 5 reduces to 
Newman's assortative mixing coefficient [37] when q ~ qA + Qs ■ f(G) G [—1,1] 
with r(G) = 1 indicating a perfect positive correlation (assortative mixing [37]), 
i.e., nodes with a given value of q(n) connect only to other nodes with the same 
value q{n), and r(G) = — 1 indicating perfect negative correlation (disassortative 
mixing [37]) i.e., nodes with small q(n) connect only to nodes with large q(n) 
or vice versa. 

Table 1 lists the four mixing coefficients for the networks studied here. The 
main trend to notice in the data is the uniformly, relative large values of T£a{G) 
and the uniformly, relative small values of taa and vac . These results indicate 
that the directed links have a disassortative preference with nodes of low efferent 
degree connecting to nodes of high afferent degree. In such networks information 
flow is inhibited since nodes with high efferent degree, tend to have low efferent 
degree thus inhibiting the efficient spread of information at the global level. 
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5 Google Rank 



The degree centrality discussed in section 3 provides clues about how important 
a given node is to the network in that it measures how many other nodes depend 
upon it. However, this is not the full picture, since a node with a small qA(n) can 
be equally important to the network as a whole if nodes with large qA( n ) depend 
upon it. To understand this phenomena one needs to quantify the importance 
of a node in a graph based on the importance of the nodes linking to it. This 
problem was famously solved by Brin and Page. [38] Their solution rests on a 
variation of the eigenvector centrality measure, which is defined in terms of the 
principal eigenvector of the adjacency matrix, A. 

The entries of the adjacency matrix in normalized form are defined as A nrn = 
l/qs(n) if (n, m) € C and otherwise. In the present case some nodes have no 
efferent links, while others have no afferent links, implying A is singular; there- 
fore there is no guarantee that all the components of the principal eigenvector 
will be non-negative, which is a necessary prerequisite for using components 
of the eigenvector as a centrality measure. To overcome the problem of non- 
existent efferent links add to A, the matrix B, defined as: B nm = if 
Qe( n ) = 0, and otherwise. To overcome the problem of no afferent links, add 
to A the matrix C defined as: C nm = l/\Af\ for all n, m. Since all the G(7V", £ ) 
constructed here are sparse, 1/l-A/] << l/qs{n), Vn; thus, the contributions 
from B and C to A are small and should not significantly distort the centrality 
measure. The new matrix is: 

P = 7(A + B) + (1- 7 )C, (6) 

and the eigenvalue equation now reads: 

R = P T R (7) 

In this form, eqs. 6 and 7, define the Google page rank. In order to understand 
the meaning of eq. 7, one can expand it: 

1 1 ( m ,n)€D A (n) Ht V ; q £ (m)=0 1 1 

showing that 7 has the desired property of weighting the rank of each node by 
the rank of the nodes which link to it. The last term in eq. 8 adds a small weight 
from all nodes with no efferent links, while the first term gives a small weight 
to nodes with no afferent links. The properties of P are well known [39]: it is a 
large, sparse, column stochastic matrix whose dominant eigenvalue is equal to 1; 
the eigenvector corresponding to the dominant eigenvalue has only non-negative 
elements; and the second largest eigenvalue is 7. As long as 7 >> 1/|A/], the 
exact value does not influence the results, however, if a power method is used to 
solve for the principal eigenvector, the rate of convergence is equal to 7; meaning 
the power method will fail to converge as 7 — ► 1. In this study the value 7 = 0.9 
is used. 
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Figs. 5 depicts the average rank as a function of the number of afferent links 
for the various graphs considered here. As can be seen there is a clear trend, with 
large numbers of afferent links correlating with high average rank. However, the 
average values glosses over significant details of the graph structure as can be 
seen by examining Table 2. The number third ranked Debian package has a 
mere 8 afferent links, while the 7th ranked package has 5, 620. And the number 
6th ranked jar file in the Maven repositories has a only 2 afferent links while 
the file ranked 7th has 1, 092. The reason a node with a very small number 
of afferent links can be ranked so highly is that other, more highly rank nodes 
depend upon it. 

6 Summary and Conclusions 

This study has demonstrated three points. Firstly, the distribution of afferent 
and efferent links for dependency graphs of large software systems are similar 
independent of the details of how the system is constructed. For the afferent 
links, this distribution obeys Zipf 's Law. For the efferent links, the distribution 
is lognormal. Both of these distributions can be explained in terms of the same 
stochastic growth process, with the differences in the final form of the distribu- 
tion explainable in terms of the different time scales over which the growth takes 
place. From these results one can hypothesize, that the dependency graph of 
any sufficiently large software system, will, from a complexity viewpoint, have 
the same properties. 

Secondly, this study has shown that the small world metaphor does not 
apply to large software systems, because the global efficiency is too small. This 
result contrasts with previous work which used the small world definition for 
directed networks, but can be understood by considering about the high level 
architecture of the Debian system. From the software engineering point of view, 
the Debian distribution is well structured, employing a layered architecture with 
the Linux operating system on the bottom, and Gnome/GTK+ (or KDE/Qt) 
applications on the top. In principle, developers should target their applications 
at a particular layer, building upon services in the layer below, while offering 
services to the layer above. In practice, applications in a given layer will often 
use services from any or all of the lower lying layers, but never from a higher 
layer. The long-range links typical of small- world networks would, in a layered 
architecture, proceed from higher layers down to the lower layers. Without the 
reverse couplings information flows upward in the stack, but not downward; 
thus stifling the global efficiency and the classical small-world effects. It would 
be an interesting extension to develop a signature for layered networks similar 
in spirit to the signature for small world networks. 

Thirdly, the importance of a node in a dependency graph depends not only 
on the number of afferent links, but also on the importance of the node from 
which the afferent link arises. For example, the glibc-doc package has only 
8 afferent links; however, it is ranked 3rd amongst all Debian packages because 
the number one ranked package, libc6, links to it. 
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To further understand how CAS considerations can play a more active role 
in software engineering consider the questions of stability, maintainability and 
maturity raised in the introduction. In the context of object oriented software 
development, an important software quality metric is the instability, which is 
defined as the ratio of the number of efferent links to the sum of efferent and 
afferent links for a given object [40] . Within the CAS paradigm, one would apply 
this metric on the system, as a whole using the methodology for determining 
the link distribution outlined above. In the notation used here, the instability 
of node n, would be written as: I{n) = qs/(q£ + qa)- This would lead to a 
measure of instability for each open source package and provide important clues 
about the risk of using any given package in one's own project. Large values of 
/(n), indicate an instable package which is likely to change more often than a 
package with a relatively smaller value of I(n). 

The maintainability and maturity of a software package is not just a ques- 
tion of its age or the number of previous releases, but also a question of its 
acceptance. If it is not being used by other projects, then it will sooner or later 
fade away. Determining which packages are more likely to be maintained and 
updated over time, is a key factor in deciding whether or not to use open source 
software in mission-critical settings. Tsatsaronis et al. [20] tackle this question 
by creating a model for open source software repositories containing 19 common 
parameters for each package. While these parameters are very good at judging 
the current health of an open source project, they do not provide enough in- 
sight into question of whether or not a package will likely be maintained in the 
long run. Successful projects will always look good according to the criterion of 
Tsatsaronis et al. as long as the project's originator is still running the project; 
however, what happens when the project's originator, for whatever reason, is no 
longer able to coordinate the project? How likely is is that the software will be 
maintained? This results shown here suggests that the answer to this question, 
is to expand Tsatsaronis et al. parameter list to include the Google Rank of the 
package in the network of all open-source software. The more central a given 
package is to the system as a whole, the more likely it will be that a talented 
developer will step forward to maintain a package once its originator has left. 

Finally, as discussed above, the layered architectures examined here do not 
exhibit small-world properties for good reasons; however, it is an open ques- 
tion whether or not other architectural patterns might benefit from small-world 
behavior. In particular when using a Service Oriented Architecture (SOA) to 
establish an ecosystem of services [41], the small- world property may have ad- 
vantages and may arise naturally. The important concept to consider is how 
information should flow in the system and whether information should flow as 
freely over long distance as it does over short distances. 
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Tables 



Table 1: The global and local efficiencies, E(G) and (E(G(n))) respectively 
along with the Pearson coefficients for the links, e.g., rg A (G) measures the 
correlation of q£ and q A between connected nodes. 
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-0.00043 
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0.00072 


-0.020 


Gc 


0.024 


0.33 


-0.021 


-0.28 


-0.033 


-0.095 
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Tabic 2: Rank and number of links. 
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Class 
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123 
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AbstractStringBuilder 
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Figures 

Figure 1 Complementary cumulative distribution function, P(Qa > Qa), for 
in Ge>. The solid line, P(Qa > Qa) = ^/qa, is a guide to the eye. 

Figure 2 Complementary cumulative distribution function, P(Qa > 1a) for qA 
in Gc (circles) and Gj (squares). The solid line, P(Qa > Qa) = ^/<1A, is a 
guide to the eye. 

Figure 3 Complementary cumulative distribution function, P{Qs > Qs)i f° r Qs 
in Gd- The solid line is a best fit to crfc ^ Mgg)- 2 ^-^/^ )° T ^ ^ which stems 
from eq. 2. 

Figure 4 Complementary cumulative distribution function, P(Qe > qe), f° r Iz 
in Gc (circles) and Qj (squares). The solid lines are best fits to erfc ( ln ^ £ ^~ 2 ^/~i^ 
which stems from eq. 2. 

Figure 5 Average rank versus qA- The circles represent Gc, the triangles, Gj 
and the squares Gd- 



19 



1 - 




1 10 100 

y - Efferent Links 




j- Efferent Links 



crq 
d 

01 



C 

CO 

DC 


O) 
OJ 
i— 

> 
< 



0.1 



0.01 - 



0.001 - 



1e-04 - 



1e-05 - 



1e-06 




10 



100 



1000 
1a 



10000 



100000 1e+06 



